Introduction

In this project, we use R and apply exploratory data analysis techniques to explore relationships in one variable to multiple variables and to explore a selected data set for distributions, outliers, and anomalies. This data set is about red wine quality. It contains some chemical properties for each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). We want to determine which chemical properties influence the quality of red wines.

Global overview

After loading the data, lets take a global view. The types of variables and some examples of values:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The first 3 observations:

##   id fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1  1           7.4             0.70        0.00            1.9     0.076
## 2  2           7.8             0.88        0.00            2.6     0.098
## 3  3           7.8             0.76        0.04            2.3     0.092
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
##   quality
## 1       5
## 2       5
## 3       5

A global summary of the statistics:

##        id         fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

There are 1599 observations with 13 variables. The first one is the id of the observation. All variables are numerical. Some of them seem to have outliers.

Univariate Plots Section

First, lets explore the values of “quality”, our outcome variable:

The variable “quality” has only 6 different discrete values (3, 4, 5, 6, 7, 8), so it is converted to factor type.

Now lets explore the distribution for each of the other variables:

Most of the histograms are skewed right.

Volatile acidity seems to have a core of common values between 0.3 and 0.7. Lets see them as buckets:

Around 25% of observations have a value for citric acid that is lower than 0.1. 50% of observations have a value for citric acid that is between 0.1 and 0.4:

There is a value of 1, probably a measurement error: excluding it, the biggest value around 0.8.

Free sulfur dioxide seems to have a group of very common values. Lets investigate further with some transformations (log10 and sqrt):

A log transformation of free.sulfur.dioxide reveals a more or less normal distribution. On the other hand, a sqrt transformation reveals that there are three common values around 6, 11 and 16.

Also it seems that regular wines (quality 5 or 6) tend to have higher values (14, 15) of free sulfur dioxide:

Now lets create a new variables, bound sulfur dioxide (nonfree.sulfur.dioxide = total.sulfur.dioxide - free.sulfur.dioxide), and compare it with free and total sulfur dioxide:

Bound sulfur dioxide (nonfree.sulfur.dioxide) tends to have slightly higher values than free sulfur dioxide. Total sulfur dioxide seems to be more smoothed along the values.

Now we are going to calculate the percentage of free sulfur dioxide. We call this new variable pfree.sulfur.dioxide:

The percentage of free sulfur dioxide has a distribution almost normal, with mean around 0.4.

Regarding alcohol variable, most of the observations have an alcohol value between 9 and 12, with a median of 10:

Now lets compare the different distributions for each level of quality. For this, we considere 3 groups of qualities: bad (4 or lower), regular (5 or 6) and good (7 or higher). We create a new variable (class) indicating the group.

It seems that bad wines have a bigger volatile acidity, and they don’t have high citric acid values. Also they tend to have lower sulphate values. Good wines tend to have more alcohol.

Univariate Analysis

What is the structure of your dataset?

The dataset is a tidy one and it has 1599 observations with 13 variables for each one. All of the observations are numerical. The first one is an index. The “quality” variable has only 6 discrete values: 4, 5, 6, 7, 8.

What is/are the main feature(s) of interest in your dataset?

Since “quality” is the outcome, the variables “volatile.acidity”, “citric.acid”, “sulphates” and “alcohol” seem to be interesting. The distributions for these variables tend to be different across levels of “quality”.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Maybe the free sulfur dioxide variable could contribute to predict the outcome. We will further investigate this variable or others derived from it.

Did you create any new variables from existing variables in the dataset?

When using the values of the variable “alcohol”, they are rounded to integer values. The outcome “quality” was converted to factor type with 6 levels.

We also created three new variables:

  • nonfree.sulfur.dioxide: the result of subtract free.sulfur.dioxide from total.sulfur.dioxide
  • pfree.sulfur.dioxide: percentage of free sulfur dioxide
  • class: to group wines in three classes -> bad (qualities 3 and 4), regular (qualities 5 and 6) and good (qualities 7 and 8).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most of the features have outliers that are far beyond the 3rd quartile in their distributions. Maybe this is also one of the reasons why most of them are righ skewed.

Volatile acidity seems to have a core of common values between 0.3 and 0.7. Half of the citric acid values are between 0.09 and 0.42. It has an outlier value of 1, probably a measurement error: excluding it, the biggest value is 0.79. There are very few observations with a “quality” value different of 5 or 6.

For free sulfur dioxide we detected a very common value between 5 and 6. A logarithmic transformation gave us a distribution more similar to a normal one. Also we tried to perform a sqrt transformation and we detected common values around 6, 11 and 16. The median values for regular wines (quality 5 or 6) are higher than the median values for other qualities (bad and good, which have similar free sulfur dioxide median values).

Regarding the new variables, bound sulfur dioxide (“nonfree.sulfur.dioxide”) tends to have bigger values than free sulfur dioxide. The percentage of free sulfur dioxide (“pfree.sulfur.dioxide”) has a distribution almost normal, with mean around 0.4.

Most of the observations have an alcohol value between 9 and 12, with a median of 10. It is strange that wines with a quality of 5 tend to have less alcohol.

As explained above, values of “alcohol” variable are rounded to integer and the outcome (“quality”) was converted to factor type. There are not other remarkable changes in the data.

Bivariate Plots Section

We check the Pearson’s correlation between all pairs of variables. We can see it in a graphical way using the psych package:

As suspected, our initial guess about the main features is consistent with the correlation values we obtained before. The features “volatile.acidity”, “citric.acid”, “sulphates” and “alcohol” shows the bigger correlation values, with an absolute value ranging from 0.23 to 0.48. In the case of “volatile.acidity”, it is a negative correlation.

Lets examine these variables with some boxplots by quality classes:

Box plots corroborate our previous findings. We see a clear positive tendency in all of them (except in volatile.acidity, where it is negative). In the case of alcohol we can not see a difference between bad and regular wines. The values of alcohol for regular wines seems to be very spread. Lets examine it using directly the quality values (5 and 6):

There is a jump for alcohol variable between qualities 5 and 6. Maybe this is a separation between potentially bad wines and potentially good wines.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Based on our previous analysis, we have been checking some correlations. Box plots diagrams for each quality level have shown a tendency for these variables: “volatile.acidity”, “citric.acid”, “sulphates” and “alcohol”. All the cases except “volatile.acidity” are positive correlations. This is normal, because “volatile.acidity” is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. For values of 5 in the “quality” variable the values for “alcohol” are very spread, although the tendency is that good wines (quality 7 or 8) have the highest median level of alcohol.

Furthermore, correlation matrices have given us a global overview of all pairwise relations in a numerical and graphical ways.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

According to the correlation matrix, there are some other features with high correlation between them. These correlations are even higher than those commented above (regarding the “quality”). Variable “fixed.acidity” is correlated with “density”, “citric.acid” and “pH”. Of course the correlation is negative for “pH” because a low pH indicates a very acidic environment. This is the reason why “citric.acid” and “pH” are also negatively correlated.

As expected, variable “citric.acid” is negatively correlated with “volatile.acidity” too. This variable is more or less the opposite to “fixed.acidity”.

The negative effect of alcohol for the “density” variable is stronger than the positive effect of “residual.sugar”.

Finally, as expected, the free sulfur dioxide is correlated with the total sulfur dioxide. And of course the new variables “nonfree.sulfur.dioxide” and “pfree.sulfur.dioxide” are related with the previous ones.

What was the strongest relationship you found?

The strongest relationship, ignoring that between “total.sulfur.dioxide” and “nonfree.sulfur.dioxide”, is the negative correlation (-0.68) between “fixed.acidity” and “pH”. As commented before, this is totally normal.

Multivariate Plots Section

Lets examine and compare the combinations of our main 4 features taking into account the quality of wine as color.

For prediction purposes, we have two main problems: 1) unbalanced observation types (too many regular wines) and 2) the regular wines are very spread across feature values, so they are mixed with bad and good classes. Maybe what we should try is to predict good (or bad) wines, not to try to classify into the three classes. Lets check only bad wines against good wines. In this case, we also add some density 2D maps in order to see where are located the clusters or groups for each combination of features:

If we select only “bad” and “good” wines we can appreciate that most of good wines have medium values of citric acid and low values of volatile acidity. On the other hand, bad wines usually have medium-high volatile acidity and low citric acid. This is similar for combinations of “volatile.acidity” with “sulphates” or “alcohol”: good wines are upper left and bad wines are lower right. This tendency is similar in “alcohol” vs “citric.acid” or “sulphates”, although in this case good wines are on the upper right and bad wines on the lower left. For the combination “citric.acid” vs “sulphates”, we can appreciate more or less an horizontal line separating good and bad wines:

And finally, lets create 4 simple linear models using our four main features. The first model includes only “alcohol” as predictor. Then next model add “volatile.acidity”. Model 3 adds also “sulphates”, and the last model adds “citric.acid”. We have been testing different combinations (data not shown) to find this one:

## 
## Calls:
## m1: lm(formula = I(quality_num ~ alcohol), data = data)
## m2: lm(formula = quality_num ~ alcohol + volatile.acidity, data = data)
## m3: lm(formula = quality_num ~ alcohol + volatile.acidity + sulphates, 
##     data = data)
## m4: lm(formula = quality_num ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid, data = data)
## 
## =========================================================
##                      m1        m2        m3        m4    
## ---------------------------------------------------------
## (Intercept)        1.875***  3.095***  2.611***  2.646***
##                   (0.175)   (0.184)   (0.196)   (0.201)  
## alcohol            0.361***  0.314***  0.309***  0.309***
##                   (0.017)   (0.016)   (0.016)   (0.016)  
## volatile.acidity            -1.384*** -1.221*** -1.265***
##                             (0.095)   (0.097)   (0.113)  
## sulphates                              0.679***  0.696***
##                                       (0.101)   (0.103)  
## citric.acid                                     -0.079   
##                                                 (0.104)  
## ---------------------------------------------------------
## R-squared             0.227     0.317     0.336     0.336
## adj. R-squared        0.226     0.316     0.335     0.334
## sigma                 0.710     0.668     0.659     0.659
## F                   468.267   370.379   268.912   201.777
## p                     0.000     0.000     0.000     0.000
## Log-likelihood    -1721.057 -1621.814 -1599.384 -1599.093
## Deviance            805.870   711.796   692.105   691.852
## AIC                3448.114  3251.628  3208.768  3210.186
## BIC                3464.245  3273.136  3235.654  3242.448
## N                  1599      1599      1599      1599    
## =========================================================

We have seen that “alcohol” is the most important feature, followed by “volatile.acidity”. The “sulphates” feature adds some small improvement, but “citric.acid” do not improve the model (something we already suspected thanks to the scatter plots).

Although models are not very good (R2 are very low, 0.336 in the best case, model 3), predictions are reasonable. If we use rounded predicted quality values then we predict correctly 58% of the qualities. But if we use quality classes (bad, regular and good), then we increase the success rate to 83%:

## [1] 0.5822389
## [1] 0.833646

But there is an important thing to note. Our dataset is very unbalanced: 82% of quality values are 5 or 6 (class regular). So if we use a dummy model that always predict “regular” class, then we will achieve a success rate of 82%. If we use quality numbers, there are almost 43% of quality values “5”, so a success rate of 58% is only a small improvement.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Since the outcome “quality” is a little subjective, and also it is the median of several evaluators, we have thought that it is better to take into account only 3 categories or classes: bad wines (qualities of 4 or lower), regular wines (qualities 5 or 6) and good wines (qualities of 7 and higher). It will help us to see the differences regarding their features.

So we have compared our 4 main variables (“volatile.acidity”, “citric.acid”, “sulphates” and “alcohol”) in a pairwise mode, taking into accound the different “classes” of wine. Here we have seen that regular wines are very spread; most of the times there is not a good limit between a bad and a regular wine, or between a good and a regular wine. On the other hand, bad wines and good wines are more distinguishable between them.

What we have seen is that most of good wines have medium values of citric acid and low values of volatile acidity. Bad wines usually have medium-high volatile acidity and low citric acid. This is similar for combinations of “volatile.acidity” with “sulphates” or “alcohol”: good wines are upper left and bad wines are lower right. This tendency is similar in “alcohol” vs “citric.acid” or “sulphates”, although in this case good wines are on the upper right and bad wines on the lower left. For the combination “citric.acid” vs “sulphates”, we can appreciate more or less a horizontal line separating good and bad wines.

Were there any interesting or surprising interactions between features?

Yes, according with our previous bivariate studies, it seems that there is a positive correlation between “citric.acid” and “quality”. But if we observe the scatter plots by class of wine (only good and bad), we do not see a clear cutoff of “citric.acid” feature to distinguish good and bad wines. Then the separation is guided by the other variables.

Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, as explained before, we created 4 simple linear models using our four main features. The first model includes only “alcohol” as predictor. Then next model add “volatile.acidity”. Model 3 adds also “sulphates”, and the last model adds “citric.acid”. The R2 values of our models are not very good, although the sucess rates could be a little misleading. One of the main problems is that we have a very unbalanced dataset (too many “regular” wines). Maybe the biggest problem for the model is to distinguish between bad and regular wines, and between good and regular wines.


Final Plots and Summary

Plot One

Description One

This plot shows the densities for the distributions of all features in the dataset. They are grouped according to the three quality classes for wine: bad (4 or lower quality values) in red, regular (5 or 6) in green and good (7 or higher) in blue. Those variables with less overlapping in their density curves could help us to distinguish between quality classes. Four of the best features for this purpose are: volatile acidity, citric acid, sulphates and alcohol. Other variables also could help us to detect a specific class, like fixed acidity (good wines) and % free sulfur dioxide (regular wines).

Note: text, values and ticks of Y-axis were removed for clarity

Plot Two

Description Two

In this case, we analyse the main four features (volatile acidity, citric acid, sulphates and alcohol) with box plots by quality class of wines (bad, regular and good) using the same schema of colours. The box plots also show the mean values as a circle. For 3 of 4 features (all except volatile acidity), median (and mean) values increase with quality; although in the case of alcohol, the classes “bad” and “regular” are very similar, with the exception of some outliers in regular class. In general, the regular class values are very spread. For volatile acidity, we see a negative tendency: values of quality are higher for lower values of this feature.

Note: some outliers were removed (>= 14 for alcohol and >= 1.5 for sulphates)

Plot Three

Description Three

In this plot we show the pairwise comparison for the six combinations of the main four features. Each combination are represented in a scatter plot. We used a subset of the wines dataset selecting only wines with quality class bad or good. We also deleted some outliers (volatile acidity >= 1.5, citric acid >= 1 and sulphates >= 2). The idea is to show that these features could help to distinguish good wines from bad wines. We are omitting regular wines because their features are so spread that it is not easy to make a distinction; nevertheless, a person usually is not interested in detected a regular wine; he/she usually wants to detect a potential good wine or to avoid a bad wine.

These scatter plots also show density 2D maps for each class. This allows us to see regions or clusters of good wine and bad wine.


Reflection

We have been analysing a red wine dataset with almost 1,500 observations and 12 features. One of these features is the punctuation or quality for the wine. The objective was to analyse the other features to know their influence in wine quality. After the study of the different distributions for the features, taking into account the qualities, we determined four of the features as the most influential: volatile acidity, citric acid, sulphates and alcohol. After grouping the qualities in three classes (bad, regular and good), we saw that there was a correlation with the main features. This correlation is positive in all cases, except for volatile acidity whose correlation is negative. Multivariate analysis allowed us to see that combinations of the main features could help to determine different “spatial” regions for good wines and bad wines. We have decided that to predict regular wines does not have much sense: most of people usually want to detect a potential good wine (or avoid a bad wine).

According to our study, good wines seem to have lower volatile acidity, higher alcohol and medium-high sulphate values. Bad wines tend to have low values for citric acid; although we have seen, this feature does not improve our predictive models.

Regarding these predictive models, we have been trying a simple linear model with only one main feature, and then adding one by one the other 3 main features. Although the R2 is small, the success rates are more or less high. But this is mainly because we have a problem of unbalanced data: too many “regular” class observations.

In the future work, we should try to improve our modelling procedures balancing the data and using cross-validation techniques to detect overfitting. Also we could try some algorithm for parameters selection.

Other machine learning algorithms could work better for this problem. Decision trees could be useful to detect a path of rules to determine wine quality. Also classification algorithms could be used since quality is in fact an ordered categorical variable. There are more powerful methods, like random forest or Support Vector Machines (SVM); they could help us to get good predictors, but it would be more complicated to interpret the resulting models. k-Nearest Neighbours algorithm (k-NN) could work very well in this context, but it will not explain anything about the underlying model.